##Data Analysis #2
## 'data.frame': 1036 obs. of 10 variables:
## $ SEX : Factor w/ 3 levels "F","I","M": 2 2 2 2 2 2 2 2 2 2 ...
## $ LENGTH: num 5.57 3.67 10.08 4.09 6.93 ...
## $ DIAM : num 4.09 2.62 7.35 3.15 4.83 ...
## $ HEIGHT: num 1.26 0.84 2.205 0.945 1.785 ...
## $ WHOLE : num 11.5 3.5 79.38 4.69 21.19 ...
## $ SHUCK : num 4.31 1.19 44 2.25 9.88 ...
## $ RINGS : int 6 4 6 3 6 6 5 6 5 6 ...
## $ CLASS : Factor w/ 5 levels "A1","A2","A3",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ VOLUME: num 28.7 8.1 163.4 12.2 59.7 ...
## $ RATIO : num 0.15 0.147 0.269 0.185 0.165 ...
#### Section 1: (5 points) ####
(1)(a) Form a histogram and QQ plot using RATIO. Calculate skewness and kurtosis using ‘rockchalk.’ Be aware that with ‘rockchalk’, the kurtosis value has 3.0 subtracted from it which differs from the ‘moments’ package.
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Numeric variables
## mydata$RATIO
## min 0.0673
## med 0.1391
## max 0.3118
## mean 0.1420
## sd 0.0293
## skewness 0.7147
## kurtosis 1.6673
## nobs 1036
## nmissing 0
(1)(b) Tranform RATIO using log10() to create L_RATIO (Kabacoff Section 8.5.2, p. 199-200). Form a histogram and QQ plot using L_RATIO. Calculate the skewness and kurtosis. Create a boxplot of L_RATIO differentiated by CLASS.
## Numeric variables
## mydata$L_RATIO
## min -1.1717
## med -0.8565
## max -0.5062
## mean -0.8566
## sd 0.0890
## skewness -0.0939
## kurtosis 0.5354
## nobs 1036
## nmissing 0
(1)(c) Test the homogeneity of variance across classes using bartlett.test() (Kabacoff Section 9.2.2, p. 222).
##
## Bartlett test of homogeneity of variances
##
## data: L_RATIO by CLASS
## Bartlett's K-squared = 3.1891, df = 4, p-value = 0.5267
##
## Bartlett test of homogeneity of variances
##
## data: RATIO by CLASS
## Bartlett's K-squared = 21.49, df = 4, p-value = 0.0002531
Essay Question: Based on steps 1.a, 1.b and 1.c, which variable RATIO or L_RATIO exhibits better conformance to a normal distribution with homogeneous variances across age classes? Why?
Answer: L_RATIO exhibits better conformance to a normal distribution than RATIO. The log10 transformation reduces skewness, bringing it much closer to symmetry (only ~9% off from zero skewness). The QQ plot of L_RATIO shows points more closely aligned with the diagonal, and its histogram is more bell-shaped compared to RATIO. While both variables still have positive kurtosis, meaning heavier tails than the normal distribution, L_RATIO’s kurtosis is milder, indicating fewer extreme deviations. In addition, Bartlett’s test indicates that L_RATIO provides more homogeneous variances across age classes than RATIO. For L_RATIO, Bartlett’s test produced p = 0.527, which is well above 0.05. This means we fail to reject the null hypothesis of equal variances, so the assumption of homogeneity of variance across age classes is supported. Taken together, these results suggest that L_RATIO better satisfies the assumptions of normality and variance homogeneity required for many statistical analyses.
#### Section 2 (10 points) ####
(2)(a) Perform an analysis of variance with aov() on L_RATIO using CLASS and SEX as the independent variables (Kabacoff chapter 9, p. 212-229). Assume equal variances. Perform two analyses. First, fit a model with the interaction term CLASS:SEX. Then, fit a model without CLASS:SEX. Use summary() to obtain the analysis of variance tables (Kabacoff chapter 9, p. 227).
## Df Sum Sq Mean Sq F value Pr(>F)
## CLASS 4 1.055 0.26384 38.370 < 2e-16 ***
## SEX 2 0.091 0.04569 6.644 0.00136 **
## CLASS:SEX 8 0.027 0.00334 0.485 0.86709
## Residuals 1021 7.021 0.00688
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Df Sum Sq Mean Sq F value Pr(>F)
## CLASS 4 1.055 0.26384 38.524 < 2e-16 ***
## SEX 2 0.091 0.04569 6.671 0.00132 **
## Residuals 1029 7.047 0.00685
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Essay Question: Compare the two analyses. What does the non-significant interaction term suggest about the relationship between L_RATIO and the factors CLASS and SEX?
Answer: When comparing the two analyses, the results are very similar. In both models, CLASS and SEX are statistically significant predictors of L_RATIO (p < 0.0014). This indicates that both age class and sex independently influence variation in L_RATIO. The interaction term CLASS:SEX, however, is not statistically significant (p ≈ 0.867). This suggests that the effect of CLASS on L_RATIO does not depend on SEX, and likewise, the effect of SEX does not vary across CLASS. In other words, the relationship between L_RATIO and CLASS is consistent for both sexes, and the relationship between L_RATIO and SEX is consistent across classes. Because the interaction is non-significant, the simpler model without the interaction CLASS + SEX is preferred.
(2)(b) For the model without CLASS:SEX (i.e. an interaction term), obtain multiple comparisons with the TukeyHSD() function. Interpret the results at the 95% confidence level (TukeyHSD() will adjust for unequal sample sizes).
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = L_RATIO ~ CLASS + SEX, data = mydata)
##
## $CLASS
## diff lwr upr p adj
## A2-A1 -0.01248831 -0.03876038 0.013783756 0.6919456
## A3-A1 -0.03426008 -0.05933928 -0.009180867 0.0018630
## A4-A1 -0.05863763 -0.08594237 -0.031332896 0.0000001
## A5-A1 -0.09997200 -0.12764430 -0.072299703 0.0000000
## A3-A2 -0.02177176 -0.04106269 -0.002480831 0.0178413
## A4-A2 -0.04614932 -0.06825638 -0.024042262 0.0000002
## A5-A2 -0.08748369 -0.11004316 -0.064924223 0.0000000
## A4-A3 -0.02437756 -0.04505283 -0.003702280 0.0114638
## A5-A3 -0.06571193 -0.08687025 -0.044553605 0.0000000
## A5-A4 -0.04133437 -0.06508845 -0.017580286 0.0000223
##
## $SEX
## diff lwr upr p adj
## I-F -0.015890329 -0.031069561 -0.0007110968 0.0376673
## M-F 0.002069057 -0.012585555 0.0167236690 0.9412689
## M-I 0.017959386 0.003340824 0.0325779478 0.0111881
Additional Essay Question: first, interpret the trend in coefficients across age classes. What is this indicating about L_RATIO? Second, do these results suggest male and female abalones can be combined into a single category labeled as ‘adults?’ If not, why not?
Answer: The TukeyHSD results show a clear declining trend of p-value in L_RATIO across age classes, indicating that younger abalones have proportionally higher shuck-to-volume ratios than older ones. This confirms that age strongly influences L_RATIO. Regarding SEX, Males and Females show no significant difference in mean L_RATIO, suggesting they could be combined into a single Adult category. However, Infants differ significantly from both groups, meaning they should remain distinct.
#### Section 3: (10 points) ####
(3)(a1) Here, we will combine “M” and “F” into a new level, “ADULT”. The code for doing this is given to you. For (3)(a1), all you need to do is execute the code as given.
##
## ADULT I
## 707 329
(3)(a2) Present side-by-side histograms of VOLUME. One should display infant volumes and, the other, adult volumes.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Essay Question: Compare the histograms. How do the distributions differ? Are there going to be any difficulties separating infants from adults based on VOLUME?
Answer: Volume is a useful but imperfect separator. While it shows distinct tendencies (infants smaller, adults larger), the overlap zone means you’d misclassify some abalones if you relied on volume alone. Additional variables might improve accuracy.
(3)(b) Create a scatterplot of SHUCK versus VOLUME and a scatterplot of their base ten logarithms, labeling the variables as L_SHUCK and L_VOLUME. Please be aware the variables, L_SHUCK and L_VOLUME, present the data as orders of magnitude (i.e. VOLUME = 100 = 10^2 becomes L_VOLUME = 2). Use color to differentiate CLASS in the plots. Repeat using color to differentiate by TYPE.
Additional Essay Question: Compare the two scatterplots. What effect(s) does log-transformation appear to have on the variability present in the plot? What are the implications for linear regression analysis? Where do the various CLASS levels appear in the plots? Where do the levels of TYPE appear in the plots?
Answer: The log-transformation of SHUCK and VOLUME substantially reduces the heteroscedasticity (unequal spread) that is present in the raw scatterplots. In the raw plots, the variance in SHUCK increases with larger VOLUME values, producing a “fan-shaped” spread. After the log10 transformation, the data cloud becomes more evenly distributed across the range, producing a straighter linear trend with more constant variance.Implications for regression analysis: In the raw data, heteroscedasticity would violate the assumptions of ordinary least squares regression, potentially leading to inefficient estimates and biased inference. After log-transformation, the residual variance is stabilized, making linear regression assumptions more appropriate. This allows for more reliable hypothesis tests and confidence intervals. Location of CLASS levels: In the raw plots colored by CLASS, younger classes (A1, A2) cluster toward the lower-left corner, with smaller SHUCK and VOLUME values. Older classes (A4, A5) extend toward the upper-right, reflecting larger body sizes. After log-transformation, these classes remain ordered along the regression line. Location of TYPE levels: In the TYPE plots, infants (I) are located toward the lower-left, while adults occupy the majority of the upper and central portions of the plot. After log transformation, the separation between infants and adults is clearer: infants cluster tightly at lower log values, while adults span a broader range. Conclusion: The log-transformation improves model fit by stabilizing variance and highlighting proportional differences. It clarifies the relative positions of both CLASS and TYPE, making regression analysis more interpretable.
#### Section 4: (5 points) ####
(4)(a1) Since abalone growth slows after class A3, infants in classes A4 and A5 are considered mature and candidates for harvest. You are given code in (4)(a1) to reclassify the infants in classes A4 and A5 as ADULTS.
##
## ADULT I
## 747 289
(4)(a2) Regress L_SHUCK as the dependent variable on L_VOLUME, CLASS and TYPE (Kabacoff Section 8.2.4, p. 178-186, the Data Analysis Video #2 and Black Section 14.2). Use the multiple regression model: L_SHUCK ~ L_VOLUME + CLASS + TYPE. Apply summary() to the model object to produce results.
##
## Call:
## lm(formula = L_SHUCK ~ L_VOLUME + CLASS + TYPE, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.270634 -0.054287 0.000159 0.055986 0.309718
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.796418 0.021718 -36.672 < 2e-16 ***
## L_VOLUME 0.999303 0.010262 97.377 < 2e-16 ***
## CLASSA2 -0.018005 0.011005 -1.636 0.102124
## CLASSA3 -0.047310 0.012474 -3.793 0.000158 ***
## CLASSA4 -0.075782 0.014056 -5.391 8.67e-08 ***
## CLASSA5 -0.117119 0.014131 -8.288 3.56e-16 ***
## TYPEI -0.021093 0.007688 -2.744 0.006180 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.08297 on 1029 degrees of freedom
## Multiple R-squared: 0.9504, Adjusted R-squared: 0.9501
## F-statistic: 3287 on 6 and 1029 DF, p-value: < 2.2e-16
Essay Question: Interpret the trend in CLASS levelcoefficient estimates? (Hint: this question is not asking if the estimates are statistically significant. It is asking for an interpretation of the pattern in these coefficients, and how this pattern relates to the earlier displays).
Answer: The coefficients for CLASS relative to the baseline group, A1, show a clear, downward trend: A2 at –0.018, A3 at –0.047, A4 at –0.076, and A5 at –0.117. This pattern means that, after adjusting for L_VOLUME and TYPE, abalones in higher CLASS groups (older aged) tend to have progressively lower values of L_SHUCK relative to their VOLUME. As CLASS increases from A1 to A5, the ratio of SHUCK to VOLUME decreases on the log scale. This indicates that, even when comparing abalones of the same volume, older classes return proportionally smaller shuck weights. In earlier histograms and boxplots of RATIO and L_RATIO by CLASS, we saw that older age classes clustered lower, indicating a declining trend in the shuck-to-volume ratio. The regression coefficients confirm this same decline, but now within a model that accounts for volume scaling (via log transformation) and TYPE. The regression result quantifies what was visible in the exploratory plots: age class differences are not random, but instead follow a consistent downward slope across classes. The CLASS coefficient pattern reflects a trend: as abalones age, their meat yield relative to size diminishes. This pattern, already visible in the visual displays, is now captured numerically and confirmed through regression.
Additional Essay Question: Is TYPE an important predictor in this regression? (Hint: This question is not asking if TYPE is statistically significant, but rather how it compares to the other independent variables in terms of its contribution to predictions of L_SHUCK for harvesting decisions.) Explain your conclusion.
Answer: In the regression, TYPE (Infant vs. Adult) has a coefficient of about –0.021 for Infant (I) compared to Adults, with a relatively small magnitude compared to the effects of CLASS and L_VOLUME. L_VOLUME is by far the dominant predictor: its coefficient is ~1.00, and it explains nearly all of the variability in L_SHUCK. This makes sense, since shuck weight scales almost directly with volume. CLASS effects are larger than TYPE’s, ranging from –0.018 (A2) down to –0.117 (A5). These differences accumulate across age groups and reflect a clear trend that affects yield. TYPE does adjust predictions slightly (Infants have proportionally less shuck weight relative to adults of the same volume), but its effect size is modest compared to L_VOLUME and CLASS. If the goal is to predict absolute shuck yield, TYPE contributes only a minor refinement to predictions beyond volume and age class. Volume and class information provide the most useful predictors for harvesting, since they capture the bulk of the variability and the systematic decline with age. TYPE could still be useful as a secondary adjustment factor, but on its own it would not be sufficient to guide harvesting decisions.
The next two analysis steps involve an analysis of the residuals resulting from the regression model in (4)(a) (Kabacoff Section 8.2.4, p. 178-186, the Data Analysis Video #2).
#### Section 5: (5 points) ####
(5)(a) If “model” is the regression object, use model$residuals and construct a histogram and QQ plot. Compute the skewness and kurtosis. Be aware that with ‘rockchalk,’ the kurtosis value has 3.0 subtracted from it which differs from the ‘moments’ package.
## Skewness: -0.0595
## Kurtosis: 0.3433
(5)(b) Plot the residuals versus L_VOLUME, coloring the data points by CLASS and, a second time, coloring the data points by TYPE. Keep in mind the y-axis and x-axis may be disproportionate which will amplify the variability in the residuals. Present boxplots of the residuals differentiated by CLASS and TYPE (These four plots can be conveniently presented on one page using par(mfrow..) or grid.arrange(). Test the homogeneity of variance of the residuals across classes using bartlett.test() (Kabacoff Section 9.3.2, p. 222).
##
## Bartlett test of homogeneity of variances
##
## data: resid by CLASS
## Bartlett's K-squared = 3.6882, df = 4, p-value = 0.4498
Essay Question: What is revealed by the displays and calculations in (5)(a) and (5)(b)? Does the model ‘fit’? Does this analysis indicate that L_VOLUME, and ultimately VOLUME, might be useful for harvesting decisions? Discuss.
Answer: The displays and calculations suggest that the regression model fits the data very well. The histogram of residuals is approximately bell-shaped, and the QQ plot shows only minor deviations in the tails (caution when extrapolating to extreme values). The skewness (-0.0595) is very close to 0, and the kurtosis (0.3433) is only slightly above 0, indicating near-normal residuals with only mild heavy tails. This supports the assumption of normally distributed errors. Both the residuals vs. fitted plots (colored by CLASS and TYPE) show that residuals are evenly spread around zero with no strong funnel shape. The Bartlett test across CLASS (p = 0.4498) confirms homogeneity of variance — residual spread is consistent across classes. The residual standard error is low, and the R² reported earlier (~0.95) indicates the model explains most of the variability in L_SHUCK. This suggests that the model predictions are very reliable. Since L_VOLUME is the main predictor and residuals are well-behaved, VOLUME, and its log transformation, appears to be a useful and stable predictor for SHUCK weight. This means managers can use measurements of volume to make informed decisions about abalone harvesting. The stability across CLASS and TYPE strengthens confidence that predictions are not strongly biased by age.
Harvest Strategy:
There is a tradeoff faced in managing abalone harvest. The infant population must be protected since it represents future harvests. On the other hand, the harvest should be designed to be efficient with a yield to justify the effort. This assignment will use VOLUME to form binary decision rules to guide harvesting. If VOLUME is below a “cutoff” (i.e. a specified volume), that individual will not be harvested. If above, it will be harvested. Different rules are possible.The Management needs to make a decision to implement 1 rule that meets the business goal.
The next steps in the assignment will require consideration of the proportions of infants and adults harvested at different cutoffs. For this, similar “for-loops” will be used to compute the harvest proportions. These loops must use the same values for the constants min.v and delta and use the same statement “for(k in 1:10000).” Otherwise, the resulting infant and adult proportions cannot be directly compared and plotted as requested. Note the example code supplied below.
#### Section 6: (5 points) ####
(6)(a) A series of volumes covering the range from minimum to maximum abalone volume will be used in a “for loop” to determine how the harvest proportions change as the “cutoff” changes. Code for doing this is provided.
(6)(b) Our first “rule” will be protection of all infants. We want to find a volume cutoff that protects all infants, but gives us the largest possible harvest of adults. We can achieve this by using the volume of the largest infant as our cutoff. You are given code below to identify the largest infant VOLUME and to return the proportion of adults harvested by using this cutoff. You will need to modify this latter code to return the proportion of infants harvested using this cutoff. Remember that we will harvest any individual with VOLUME greater than our cutoff.
## [1] 526.6383
## [1] 0.2476573
## [1] 0
(6)(c) Our next approaches will look at what happens when we use the median infant and adult harvest VOLUMEs. Using the median VOLUMEs as our cutoffs will give us (roughly) 50% harvests. We need to identify the median volumes and calculate the resulting infant and adult harvest proportions for both.
## [1] 133.8214
## [1] 0.4982699
## [1] 0.9330656
## [1] 384.5584
## [1] 0.02422145
## [1] 0.4993307
(6)(d) Next, we will create a plot showing the infant conserved proportions (i.e. “not harvested,” the prop.infants vector) and the adult conserved proportions (i.e. prop.adults) as functions of volume.value. We will add vertical A-B lines and text annotations for the three (3) “rules” considered, thus far: “protect all infants,” “median infant” and “median adult.” Your plot will have two (2) curves - one (1) representing infant and one (1) representing adult proportions as functions of volume.value - and three (3) A-B lines representing the cutoffs determined in (6)(b) and (6)(c).
Essay Question: The two 50% “median” values serve a descriptive purpose illustrating the difference between the populations. What do these values suggest regarding possible cutoffs for harvesting?
Answer: The two 50% “median” values are descriptive benchmarks that highlight the separation between infant and adult populations by volume. The median infant volume suggests that if this value were chosen as a cutoff, approximately half of the infants and over 90% of the adults would be harvested. This indicates that infant and adult volume distributions overlap considerably. The median adult volume, in contrast, results in harvesting about half the adults while conserving nearly all infants. Together, these values suggest that infants tend to occupy the lower range of volumes and adults the higher, but the overlap makes it impossible to set a cutoff that both fully protects infants and still allows for a large harvest of adults. Instead, managers must balance conservation and harvest goals, with the “protect all infants” rule being the most conservative and the “median cutoffs” illustrating how trade-offs look when each group’s central tendency is used.
More harvest strategies:
This part will address the determination of a cutoff volume.value corresponding to the observed maximum difference in harvest percentages of adults and infants. In other words, we want to find the volume value such that the vertical distance between the infant curve and the adult curve is maximum. To calculate this result, the vectors of proportions from item (6) must be used. These proportions must be converted from “not harvested” to “harvested” proportions by using (1 - prop.infants) for infants, and (1 - prop.adults) for adults. The reason the proportion for infants drops sooner than adults is that infants are maturing and becoming adults with larger volumes.
Note on ROC:
There are multiple packages that have been developed to create ROC curves. However, these packages - and the functions they define - expect to see predicted and observed classification vectors. Then, from those predictions, those functions calculate the true positive rates (TPR) and false positive rates (FPR) and other classification performance metrics. Worthwhile and you will certainly encounter them if you work in R on classification problems. However, in this case, we already have vectors with the TPRs and FPRs. Our adult harvest proportion vector, (1 - prop.adults), is our TPR. This is the proportion, at each possible ‘rule,’ at each hypothetical harvest threshold (i.e. element of volume.value), of individuals we will correctly identify as adults and harvest. Our FPR is the infant harvest proportion vector, (1 - prop.infants). We can think of TPR as the Confidence level (ie 1 - Probability of Type I error and FPR as the Probability of Type II error. At each possible harvest threshold, what is the proportion of infants we will mistakenly harvest? Our ROC curve, then, is created by plotting (1 - prop.adults) as a function of (1 - prop.infants). In short, how much more ‘right’ we can be (moving upward on the y-axis), if we’re willing to be increasingly wrong; i.e. harvest some proportion of infants (moving right on the x-axis)?
#### Section 7: (10 points) ####
(7)(a) Evaluate a plot of the difference ((1 - prop.adults) - (1 - prop.infants)) versus volume.value. Compare to the 50% “split” points determined in (6)(a). There is considerable variability present in the peak area of this plot. The observed “peak” difference may not be the best representation of the data. One solution is to smooth the data to determine a more representative estimate of the maximum difference.
(7)(b) Since curve smoothing is not studied in this course, code is supplied below. Execute the following code to create a smoothed curve to append to the plot in (a). The procedure is to individually smooth (1-prop.adults) and (1-prop.infants) before determining an estimate of the maximum difference.
(7)(c) Present a plot of the difference ((1 - prop.adults) - (1 - prop.infants)) versus volume.value with the variable smooth.difference superimposed. Determine the volume.value corresponding to the maximum smoothed difference (Hint: use which.max()). Show the estimated peak location corresponding to the cutoff determined.
Include, side-by-side, the plot from (6)(d) but with a fourth vertical A-B line added. That line should intercept the x-axis at the “max difference” volume determined from the smoothed curve here.
## [1] 262.1689
(7)(d) What separate harvest proportions for infants and adults would result if this cutoff is used? Show the separate harvest proportions. We will actually calculate these proportions in two ways: first, by ‘indexing’ and returning the appropriate element of the (1 - prop.adults) and (1 - prop.infants) vectors, and second, by simply counting the number of adults and infants with VOLUME greater than the vlume threshold of interest.
Code for calculating the adult harvest proportion using both approaches is provided.
## [1] 0.7416332
## [1] 0.7416332
There are alternative ways to determine cutoffs. Two such cutoffs are described below.
#### Section 8: (10 points) ####
(8)(a) Harvesting of infants in CLASS “A1” must be minimized. The smallest volume.value cutoff that produces a zero harvest of infants from CLASS “A1” may be used as a baseline for comparison with larger cutoffs. Any smaller cutoff would result in harvesting infants from CLASS “A1.”
Compute this cutoff, and the proportions of infants and adults with VOLUME exceeding this cutoff. Code for determining this cutoff is provided. Show these proportions. You may use either the ‘indexing’ or ‘count’ approach, or both.
(8)(b) Next, append one (1) more vertical A-B line to our (6)(d) graph. This time, showing the “zero A1 infants” cutoff from (8)(a). This graph should now have five (5) A-B lines: “protect all infants,” “median infant,” “median adult,” “max difference” and “zero A1 infants.”
#### Section 9: (5 points) ####
(9)(a) Construct an ROC curve by plotting (1 - prop.adults) versus (1 - prop.infants). Each point which appears corresponds to a particular volume.value. Show the location of the cutoffs determined in (6), (7) and (8) on this plot and label each.
(9)(b) Numerically integrate the area under the ROC curve and report your result. This is most easily done with the auc() function from the “flux” package. Areas-under-curve, or AUCs, greater than 0.8 are taken to indicate good discrimination potential.
## [1] 0.8666917
#### Section 10: (10 points) ####
(10)(a) Prepare a table showing each cutoff along with the following: 1) true positive rate (1-prop.adults, 2) false positive rate (1-prop.infants), 3) harvest proportion of the total population
To calculate the total harvest proportions, you can use the ‘count’ approach, but ignoring TYPE; simply count the number of individuals (i.e. rows) with VOLUME greater than a given threshold and divide by the total number of individuals in our dataset.
## Cutoff TPR FPR TotalHarvest
## Protect all infants 526.64 0.2477 0.0000 0.1786
## Median infant 133.82 0.9331 0.4983 0.8118
## Median adult 384.56 0.4993 0.0242 0.3668
## Max smoothed diff 262.17 0.7416 0.1765 0.5840
## Zero A1 infants 206.81 0.8260 0.2872 0.6757
Essay Question: Based on the ROC curve, it is evident a wide range of possible “cutoffs” exist. Compare and discuss the five cutoffs determined in this assignment.
Answer: The ROC-style curve and AUC value of 0.867 indicate that VOLUME is a strong discriminator between infants and adults, meaning it can be used effectively to design harvesting rules. Each of the five cutoff rules emphasizes a different tradeoff between protecting infants and maximizing the harvest of adults. 1. Protect all infants: This rule is highly conservative: it guarantees no infants are harvested, but only ~25% of adults are available for harvest. While ethically appealing for infant protection, it results in a low yield of adults and may not be practical for large-scale harvesting. 2. Median infant: This rule cuts through the population aggressively, harvesting almost all adults but at the cost of harvesting about half of the infants. The high infant harvest rate makes it unsustainable, but it shows how the two populations overlap in VOLUME distributions. 3. Median adult: This rule provides a more balanced approach: about half of adults are harvested, while only ~2% of infants are mistakenly taken. It is a practical middle ground that offers a moderately reasonable adult yield while keeping infant mortality low. 4. Max smoothed difference: By maximizing the separation between adult and infant harvest rates, this rule provides a biologically meaningful cutoff. It harvests ~74% of adults while only ~18% of infants, striking a compromise between efficiency and conservation. 5. Zero A1 infants: This rule ensures that none of the youngest infants (A1 class) are harvested, but it allows more of the older infant classes to be taken. The result is a relatively high adult harvest (83%) but also a moderate infant harvest (~29%). Conclusion: The max smoothed difference cutoff stands out as the most effective compromise, offering a high adult harvest with relatively low infant loss. This balance makes it the most promising candidate for practical harvesting guidelines.
Final Essay Question: Assume you are expected to make a presentation of your analysis to the investigators How would you do so? Consider the following in your answer:
Answer: 1. Instead of making a specific recomendation, I would compare the different models generated to explain the intricacies between harvesting strategies and lightly suggest the smoothed difference cutoff. This way, if the investigators have any extra knowledge or requirements that pertains to the findings, such as a yield minimum, they can share. 2. The analysis shows that VOLUME is useful for discriminating between infants and adults, but the overlap between groups, sampling assumptions, and ecological factors limit the reliability of any single cutoff rule. Harvest guidelines should therefore be interpreted as decision-support tools, not absolute rules, and should be adapted to ecological, ethical, and management realities. 3. I would suggest caution with false positives for the cutoff I suggested, which are at ~18%. 4. I would suggest further classification of abalones by age so that the cutoff can be made more precise.